Jurilinguistic Engineering in Cantonese Chinese: An N-gram-based Speech to Text Transcription System

نویسندگان

  • Benjamin Ka-Yin T'sou
  • King Kui Sin
  • Samuel W. K. Chan
  • Tom B. Y. Lai
  • Caesar Suen Lun
  • K. T. Ko
  • Gary K. K. Chan
  • Lawrence Y. L. Cheung
چکیده

A Cantonese Chinese transcription system to automatically convert stenograph code to Chinese characters is reported. The major challenge in developing such a system is the critical homocode problem because of homonymy. The statistical N-gram model is used to compute the best combination of characters. Supplemented with a 0.85 million character corpus of domain-specific training data and enhancement measures, the bigram and trigram implementations achieve 95% and 96% accuracy respectively, as compared with 78% accuracy in the baseline model. The system performance is comparable with other advanced Chinese Speech-to-Text input applications under development. The system meets an urgent need of the Judiciary of post1997 Hong Kong. Keyword: Speech to Text, Statistical Modelling, Cantonese, Chinese, Language Engineering

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Court Stenography-To-Text ("STT") in Hong Kong: A Jurilinguistic Engineering Effort

Implementation of legal bilingualism in Hong Kong after 1997 has necessitated the production of voluminous and extensive court proceedings and judgments in both Chinese and English. For the former, Cantonese, a dialect of Chinese, is the home language of more than 90% of the population in Hong Kong and so used in the courts. To record speech in Cantonese verbatim, a Chinese Computer-Aided Trans...

متن کامل

Automatic Conversion from Phonetic to Textual Representation of Cantonese : The Case of Hong Kong Court Proceedings

The resumption of sovereignty over Hong Kong by China and the implementation of legal bilingualism there have given rise to an urgent need for producing verbatim court records of proceedings conducted in Cantonese, the predominant Chinese dialect spoken by the majority of the population. This has created a challenge to build up the jurilinguistic infrastructure vital for the full implementation...

متن کامل

The Specification of POS Tagging of the Hong Kong University Cantonese Corpus

The Hong Kong University Cantonese Corpus was collected from transcribed spontaneous speech in conversations and radio programs that involved two to four people. It was wordsegmented, annotated with Cantonese pronunciation, and recently tagged with word classes by adopting the parts-of-speech (POS) scheme of Yu et al. (2002). This scheme, which was designed for tagging written Mandarin texts, e...

متن کامل

Language modeling for speech recognition of spoken Cantonese

This paper addresses the problem of language modeling for LVCSR of Cantonese spoken in daily communication. As a spoken dialect, Cantonese is not used in written documents and published materials. Thus it is difficult to collect sufficient amount of written Cantonese text data for the training of statistical language models. We propose to solve this problem by translating standard Chinese text,...

متن کامل

Automatic speech recognition of Cantones

This paper describes our recent work on the development of a largevocabulary, speaker-independent, continuous speech recognition system for Cantonese-English code-mixing utterances. The details of both acoustic modeling and language modeling will be discussed. For acoustic modeling, Cantonese accents in English words are handled by applying cross-lingual acoustic units, as well as modifications...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000